背景和目标:现有的医学图像分割的深度学习平台主要集中于完全监督的细分,该分段假设可以使用充分而准确的像素级注释。我们旨在开发一种新的深度学习工具包,以支持对医学图像分割的注释有效学习,该学习可以加速并简单地开发具有有限注释预算的深度学习模型,例如,从部分,稀疏或嘈杂的注释中学习。方法:我们提出的名为Pymic的工具包是用于医学图像分割任务的模块化深度学习平台。除了支持开发高性能模型以进行全面监督分割的基本组件外,它还包含几个高级组件,这些高级组件是针对从不完善的注释中学习的几个高级组件,例如加载带注释和未经通知的图像,未经通知的,部分或无效的注释图像的损失功能,以及多个网络之间共同学习的培训程序。Pymic构建了Pytorch框架,并支持半监督,弱监督和噪声的学习方法用于医学图像分割。结果:我们介绍了基于PYMIC的四个说明性医学图像细分任务:(1)在完全监督的学习上实现竞争性能; (2)半监督心脏结构分割,只有10%的训练图像; (3)使用涂鸦注释弱监督的分割; (4)从嘈杂的标签中学习以进行胸部X光片分割。结论:Pymic工具包易于使用,并促进具有不完美注释的医学图像分割模型的有效开发。它是模块化和灵活的,它使研究人员能够开发出低注释成本的高性能模型。源代码可在以下网址获得:https://github.com/hilab-git/pymic。
translated by 谷歌翻译
卷积神经网络(CNN)已经实现了医学图像细分的最先进性能,但需要大量的手动注释进行培训。半监督学习(SSL)方法有望减少注释的要求,但是当数据集大小和注释图像的数量较小时,它们的性能仍然受到限制。利用具有类似解剖结构的现有注释数据集来协助培训,这有可能改善模型的性能。然而,由于目标结构的外观不同甚至成像方式,跨解剖结构域的转移进一步挑战。为了解决这个问题,我们提出了跨解剖结构域适应(CS-CADA)的对比度半监督学习,该学习适应一个模型以在目标结构域中细分相似的结构,这仅需要通过利用一组现有现有的现有的目标域中的限制注释源域中相似结构的注释图像。我们使用特定领域的批归归量表(DSBN)来单独地标准化两个解剖域的特征图,并提出跨域对比度学习策略,以鼓励提取域不变特征。它们被整合到一个自我兼容的均值老师(SE-MT)框架中,以利用具有预测一致性约束的未标记的目标域图像。广泛的实验表明,我们的CS-CADA能够解决具有挑战性的跨解剖结构域移位问题,从而在视网膜血管图像和心脏MR图像的帮助下,在X射线图像中准确分割冠状动脉,并借助底底图像,分别仅给定目标域中的少量注释。
translated by 谷歌翻译
在临床实践中,由于存储成本和隐私限制,通常需要进行分割网络在多个站点而不是合并集的顺序数据流上不断学习。但是,在持续学习过程中,现有方法通常在以前的网站上的网络记忆性或看不见的站点上的概括性中受到限制。本文旨在解决同步记忆性和概括性(SMG)的挑战性问题,并使用新颖的SMG学习框架同时提高以前和看不见的地点的性能。首先,我们提出一个同步梯度对准(SGA)目标,\ emph {不仅}通过对先前站点(称为重播缓冲区)的小型示例进行协调优化,从而促进网络的记忆力,\ emph {but emph {又增强了}的增强。通过促进模拟域移位下的现场不变性来概括。其次,为了简化SGA目标的优化,我们设计了一种双META算法,该算法将SGA目标近似为双元目标,以优化,而无需昂贵的计算开销。第三,为了有效的排练,我们全面考虑了重播缓冲区,以考虑额外的地点多样性以降低冗余。从六个机构中依次获得的前列腺MRI数据实验表明,我们的方法可以同时获得更高的记忆性和对最先进方法的可推广性。代码可在https://github.com/jingyzhang/smg-learning上找到。
translated by 谷歌翻译
随着面部伪造技术的快速发展,由于安全问题,伪造的检测引起了越来越多的关注。现有方法尝试使用频率信息在高质量的锻造面上进行微妙的伪影。然而,频率信息的开发是粗糙的,更重要的是,他们的香草学习过程努力提取细粒度的伪造痕迹。为了解决这个问题,我们提出了一个渐进式增强学习框架来利用RGB和细粒度的频率线索。具体而言,我们对RGB图像进行细粒度分解,以在频率空间中完全删除真实的迹线和虚假的迹线。随后,我们提出了一种基于双分支网络的渐进式增强学习框架,结合自增强和互增强模块。自增强模块基于空间噪声增强和渠道注意,捕获不同输入空间中的迹线。通过在共享空间维度中通信,互增强模块同时增强RGB和频率特征。逐步增强过程有助于学习具有细粒面的伪造线索的歧视特征。在多个数据集上进行广泛的实验表明我们的方法优于最先进的面部伪造检测方法。
translated by 谷歌翻译
深度神经网络通常需要准确和大量注释,以在医学图像分割中实现出色的性能。单次分割和弱监督学习是有前途的研究方向,即通过仅从一个注释图像学习新类并利用粗标签来降低标签努力。以前的作品通常未能利用解剖结构并遭受阶级不平衡和低对比度问题。因此,我们为3D医学图像分割的创新框架提供了一次性和弱监督的设置。首先,提出了一种传播重建网络,以基于不同人体中的解剖模式类似的假设将来自注释体积的划痕投射到未标记的3D图像。然后,双级功能去噪模块旨在基于解剖结构和像素级别来改进涂鸦。在将涂鸦扩展到伪掩码后,我们可以使用嘈杂的标签培训策略培训新课程的分段模型。一个腹部的实验和一个头部和颈部CT数据集显示所提出的方法对最先进的方法获得显着改善,即使在严重的阶级不平衡和低对比度下也能够稳健地执行。
translated by 谷歌翻译
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.
translated by 谷歌翻译
The task of reconstructing 3D human motion has wideranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view Mo- Cap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motiondriven animation accessible to the general public. However, performance of existing HMR methods degrade when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action,and show that NeMo achieves better 3D reconstruction compared to various baselines.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译